line break
- Leisure & Entertainment > Sports > Martial Arts (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- (13 more...)
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
Bhyravajjula, Sriharsh, Walsh, Melanie, Preus, Anna, Antoniak, Maria
Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Iowa (0.04)
- North America > United States > Indiana (0.04)
- (14 more...)
Analysis of Line Break prediction models for detecting defensive breakthrough in football
Yagi, Shoma, Ichikawa, Jun, Ichinose, Genki
In football, attacking teams attempt to break through the opponent's defensive line to create scoring opportunities. This action, known as a Line Break, is a critical indicator of offensive effectiveness and tactical performance, yet previous studies have mainly focused on shots or goal opportunities rather than on how teams break the defensive line. In this study, we develop a machine learning model to predict Line Breaks using event and tracking data from the 2023 J1 League season. The model incorporates 189 features, including player positions, velocities, and spatial configurations, and employs an XGBoost classifier to estimate the probability of Line Breaks. The proposed model achieved high predictive accuracy, with an AUC of 0.982 and a Brier score of 0.015. Furthermore, SHAP analysis revealed that factors such as offensive player speed, gaps in the defensive line, and offensive players' spatial distributions significantly contribute to the occurrence of Line Breaks. Finally, we found a moderate positive correlation between the predicted probability of being Line-Broken and the number of shots and crosses conceded at the team level. These results suggest that Line Breaks are closely linked to the creation of scoring opportunities and provide a quantitative framework for understanding tactical dynamics in football.
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.05)
- Asia > Japan > Honshū > Chūbu > Shizuoka Prefecture > Shizuoka (0.04)
- (5 more...)
- Leisure & Entertainment > Sports > Martial Arts (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- (13 more...)
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kuznetsov, Kristian, Kushnareva, Laida, Druzhinina, Polina, Razzhigaev, Anton, Voznyuk, Anastasia, Piontkovskaya, Irina, Burnaev, Evgeny, Barannikov, Serguei
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
- North America > United States (0.93)
- Atlantic Ocean (0.47)
- Europe (0.28)
- (2 more...)
- Media (0.68)
- Health & Medicine > Therapeutic Area (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
- Education > Educational Setting (0.46)
Predicting Punctuation in Ancient Chinese Texts: A Multi-Layered LSTM and Attention-Based Approach
Cai, Tracy, Chang, Kimmy, Nabi, Fahad
In fact, Previous approaches have experimented with many ancient Chinese texts contain thousands of Encoder-Decoder RNNs, GRU, and LSTMs as well lines with no distinct punctuation marks or delimiters as different single-headed attention structures (local in sight. The lack of punctuation in such texts and global) to successfully conduct language makes it difficult for humans to identify when there translation tasks. One recent work that built an pauses or breaks between particular phrases and efficient model for optimal performance in a task understand the semantic meaning of the written similar to ours (predicting line breaks) is that of text (Mogahed, 2012). As a result, unless one was Oh et al. In Oh et al (2017), researchers were able educated in the ancient time period, many readers to predict where line breaks ought to be in Hanmun of ancient Chinese would have significantly different (a punctuation-lacking Korean script) with a interpretations of the texts. We propose an approach multi-layered LSTM model that incorporated an to predict the location (and type) of punctuation end-of-sentence attention mechanism. As Luong in ancient Chinese texts that extends the work et al. (2015) found local attention models to significantly of Oh et al (2017) by leveraging a bidirectional outperform non-attentional ones on translation multi-layered LSTM with a multi-head attention tasks between English-German, we were inspired mechanism as inspired by Luong et al.'s (2015) discussion to improve upon Oh et al.'s approach towards of attention-based architectures. We find line-break prediction by paying special attention to that the use of multi-layered LSTMs and multihead the attention model.
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- Asia > China (0.04)
Lyrics Transcription for Humans: A Readability-Aware Benchmark
Cífka, Ondřej, Schreiber, Hendrik, Miner, Luke, Stöter, Fabian-Robert
Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Rhode Island (0.04)
- Europe > Greece (0.04)
- (6 more...)
- Media > Music (0.88)
- Leisure & Entertainment (0.66)
Shimo Lab at "Discharge Me!": Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections
He, Yunzhen, Yamagiwa, Hiroaki, Shimodaira, Hidetoshi
In this paper, we present our approach to the shared task "Discharge Me!" at the BioNLP Workshop 2024. The primary goal of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Participants develop a pipeline to generate the "Brief Hospital Course" and "Discharge Instructions" sections from the EHR. Our approach involves a first step of extracting the relevant sections from the EHR. We then add explanatory prompts to these sections and concatenate them with separate tokens to create the input text. To train a text generation model, we perform LoRA fine-tuning on the ClinicalT5-large model. On the final test data, our approach achieved a ROUGE-1 score of $0.394$, which is comparable to the top solutions.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- North America > United States > Pennsylvania (0.04)
- (8 more...)
Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark
Cífka, Ondřej, Dimitriou, Constantinos, Wang, Cheng-i, Schreiber, Hendrik, Miner, Luke, Stöter, Fabian-Robert
Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.
- North America > United States > Rhode Island (0.04)
- Europe > Greece (0.04)
- Europe > United Kingdom > England > East Sussex > Brighton (0.04)
- (4 more...)
- Media > Music (1.00)
- Leisure & Entertainment (0.67)